Stempelator A Hybrid Stemmer for the Polish Language
نویسنده
چکیده
Stemming algorithms, or in other words conflation of inflected forms of a word to a unique token, is a very popular tool in Information Retrieval and Data Mining. It often brings an improvement to the quality of results and decreases storage requirements of the processed information. Quite a few open source or free Polish stemmers emerged recently. They can be divided into two groups depending on the method used for conflating inflected terms: heuristic and dictionary-driven. In this work we provide a proposal that connects the two worlds into a hybrid stemmer. We show that a combination of two open source Polish stemmers—Lametyzator and Stempel—results in a stemmer that outperforms them in terms of stemming quality. We also provide an insightful analysis of experiments with different sizes and samples of data used to train Stempel, the heuristic part of Stempelator. We hope this analysis will be useful for further development of that stemmer as a stand-alone product. BibTEX entry: @techreport{ stempelator2005, author="Dawid Weiss", title="{Stempelator: A Hybrid Stemmer for the Polish Language}", institution="Institute of Computing Science, Pozna{\’n} University of Technology, Poland", type="Technical Report", number="RA-002/05", year="2005" } 04/02/05 00:24 CVS: Id: stempelator.tex,v 1.3 2005/02/03 21:38:10 dweiss Exp
منابع مشابه
MAULIK: An Effective Stemmer for Hindi Language
In this paper, a new stemmer has been proposed named as “Maulik” for Hindi Language. This stemmer is purely based on Devanagari script and it uses the Hybrid approach (combination of brute force and suffix removal approach). Stemming can be used to improve the effectiveness of information retrieval. The proposed stemmer is both computationally inexpensive and domain independent. The results are...
متن کاملویرایشگر متن شریف: سامانۀ ویرایش و خطایابی املایی زبان فارسی
In this paper, we will introduce an intelligent system to edit and spell check Persian texts. The goal is editing and preprocessing Persian texts for natural language processing tasks. This system is based on an expandable and engineering approach and is composed of three subsystems: Persian text editor, spell checker and stemmer. These parts interact with each other to edit texts. To do this, ...
متن کاملA new hybrid stemming algorithm for Persian
Stemming has been an influential part in Information retrieval and search engines. There have been tremendous endeavours in making stemmer that are both efficient and accurate. Stemmers can have three method in stemming, Dictionary based stemmer, statistical-based stemmers, and rulebased stemmers. This paper aims at building a hybrid stemmer that uses both Dictionary based method and rule-based...
متن کاملStemmers for Tamil Language: Performance Analysis
Abstract— Stemming is the process of extracting root word from the given inflection word and also plays significant role in numerous application of Natural Language Processing (NLP). Tamil Language raises several challenges to NLP, since it has rich morphological patterns than other languages. The rule based approach light-stemmer is proposed in this paper, to find stem word for given inflectio...
متن کاملA Light Weight Stemmer for Urdu Language: A Scarce Resourced Language
Stemming is a procedure that conflates morphologically related terms into a single term without doing complete morphological analysis. Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. The core tool of information retrieval (IR) is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing its ste...
متن کامل